Cloud computing for the era of brain observatories

Amazon Web Services Education: Research Seminar Series
September 30th, 2020

Ariel Rokem

Department of Psychology
eScience Institute

Follow along at: https://arokem.github.io/2020-09-30-AWS

License
Adam Richie-Halford

John Kruper

Jason Yeatman (Stanford)
Noah Simon (UW Biostats)
Eleftherios Garyfallidis (IU)



How do we relate brain to behavior?

The era of brain observatories

    Allen Institute for Brain Science

    Human Connectome Project (HCP), N = 1,200

    Adolescent Brain Cognitive Development (ABCD),
    N = 10,000

    Healthy Brain Network (HBN),  N = 10,000

    UK Biobank,   N = 500,000

Human Connectome Project extensions

Lifespan development

Brain aging and dementia

Alzheimer's subtypes

Anxiety and depression in teenagers

Amish Connectome (mental illness)

Low vision and blindness

Opportunities

New data sets will enable important new discoveries

Data-driven discovery

Data-driven discovery

Data aggregation and integration

Machine learning and data mining

Data visualization and communication

Insights into the brain basis of complex behaviors

Personalized medicine

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Open-source software for diffusion MRI

Automated tract segmentation and tractometry

Data-driven discovery in the cloud

Cloudknot: analysis at brain-observatory scale

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Open-source software for diffusion MRI

Data-driven discovery in the cloud

Cloudknot: analysis at brain-observatory scale

Brain networks

Disconnectivity syndromes

Image from Catani and ffytche (2015)

Not just static cables!

Brain connections develop and mature with age

Individual differences account for differences in behaviour

Adapt and change with learning

Brain network health is important for mental health

Diffusion MRI measures the physical properties of brain connections

Diffusion MRI

Mean diffusivity

Diffusion MRI

Isotropic diffusion

Diffusion MRI

Anisotropic diffusion

Diffusion MRI

Mean diffusivity
Fractional anisotropy
Principal diffusion direction

From diffusion to anatomy

From diffusion to anatomy

From diffusion to anatomy

From diffusion to anatomy

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Open-source software for diffusion MRI

Data-driven discovery in the cloud

Cloudknot: analysis at brain-observatory scale

From diffusion to anatomy

The 3D structure of each brain is unique

Automated tract segmentation

Automated tract segmentation

Automated tract segmentation

Tractometry

The tracts are the coordinate frame for quantitative analysis

Tractometry

The tracts are the coordinate frame for quantitative analysis

Tractometry

The tracts are the coordinate frame for quantitative analysis

Tractometry

The tracts are the coordinate frame for quantitative analysis

Tractometry

The tracts are the coordinate frame for quantitative analysis

Tractometry

The tracts are the coordinate frame for quantitative analysis

Tractometry

The tracts are the coordinate frame for quantitative analysis

Amyotrophic Lateral Sclerosis (ALS)

Neurodegenerative disease

Affects motor neurons

Etiology varies widely

Tractometry for group comparison

Tractometry for prediction

Patient/Control?

Accurately classifies patients/controls

Classification accuracy of 93% (+/- 2%)
AUC of 0.978 (+/- 0.01)

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Open-source software for diffusion MRI

Data-driven discovery in the cloud

Cloudknot: analysis at brain-observatory scale

Open-source software for science

Python: an ecosystem for scientific computing

Free and open source

High-level interpreted language

Very wide adoption

Python in Astronomy (ADS)

Community-developed open-source software


Neuroimaging in Python

A community of practice

Many different projects
http://nipy.org

Open to users

Open to contributors

Distributed collaboration

Automated tract segmentation

https://autofq.org

Tractometry at brain observatory scale

AWS Batch

Pros:

  • Abstracts away infrastructure details
  • Dynamically provisions AWS resources based on requirements of user-submitted jobs
  • Allows scientists to run 100,000+ batch jobs

Cons:

  • AWS Web Console resists automation
  • Requires learning new terminology
  • Does not easily facilitate reproducibility

AWS Batch workflow

  • Build a Docker image (local machine)
  • Create an Amazon ECR repository for the image (web)
  • Push the build image to ECR (local machine)
  • Create IAM Roles, compute environment, job queue (web)
  • Create a job definition that uses the built image (web)
  • Submit jobs (web)

Challenge

Reap the benefits of AWS Batch from the comfort of our Python env

Previous attempts

Other projects have sought to lower AWS barrier to entry

  • PiCloud (2010), acquired by Dropbox in 2013
  • pyWren (2017), built on AWS Lambda
    • 5 minute execution time
    • 1.5 GB of RAM
    • 512 MB local storage
    • no root access

API

Pedagogical example: estimating using Monte Carlo

Define the user defined function (UDF).


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)































        

N.B. we import prerequisites inside the UDF.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)































        

Instantiate a Knot, creating resources on AWS.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)





























        

Submit jobs with the map() method.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)

          n_jobs, n_samples = 1000, 100000000
          import numpy as np
          args = np.ones(n_jobs, dtype=np.int32) * n_samples
          future = knot.map(args)























        

Summarize the status of submitted jobs.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)

          n_jobs, n_samples = 1000, 100000000
          import numpy as np
          args = np.ones(n_jobs, dtype=np.int32) * n_samples
          future = knot.map(args)

          knot.view_jobs()
          [out]: Job ID          Name           Status
                 ----------------------------------------
                 fcd2a14b...     pi-calc-0      PENDING


















        

Query the result status.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)

          n_jobs, n_samples = 1000, 100000000
          import numpy as np
          args = np.ones(n_jobs, dtype=np.int32) * n_samples
          future = knot.map(args)

          knot.view_jobs()

          done_yet = future.done()












        

Retrieve the result.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)

          n_jobs, n_samples = 1000, 100000000
          import numpy as np
          args = np.ones(n_jobs, dtype=np.int32) * n_samples
          future = knot.map(args)

          knot.view_jobs()

          done_yet = future.done()
          res = future.result()











        

Or retrieve previously submitted results.


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)

          n_jobs, n_samples = 1000, 100000000
          import numpy as np
          args = np.ones(n_jobs, dtype=np.int32) * n_samples
          future = knot.map(args)

          knot.view_jobs()

          done_yet = future.done()
          res = future.result()
          res = knot.jobs[-1].result() # Equivalent to future.result()









        

Or add a callback to the final result


          import cloudknot as ck

          def monte_pi_count(n):
              import numpy as np
              x = np.random.rand(n)
              y = np.random.rand(n)
              return np.count_nonzero(x * x + y * y <= 1.0)

          knot = ck.Knot(name='pi-calc', func=monte_pi_count)

          n_jobs, n_samples = 1000, 100000000
          import numpy as np
          args = np.ones(n_jobs, dtype=np.int32) * n_samples
          future = knot.map(args)

          knot.view_jobs()

          done_yet = future.done()
          res = future.result()
          res = knot.jobs[-1].result()  # Equivalent to future.result()


          PI = 0.0
          def pi_from_future(future):
              global PI
              PI = 4.0 * np.sum(future.result()) / (n_samples * n_jobs)

          future.add_done_callback(pi_from_future)
        

Design

Single Program


                    import cloudknot as ck

                    def awesome_func(...):
                        ...

                    knot = ck.Knot(func=awesome_func)




                
Cloudknot workflow

Multiple Data


                    import cloudknot as ck

                    def awesome_func(...):
                        ...

                    knot = ck.Knot(func=awesome_func)

                    ...

                    future = knot.map(args)
                
Cloudknot workflow

Analysis of MRI data using DIPY (Garyfallidis et al., 2014)

Brain extraction

Denoising

Tensor fitting

(see code)

Compare to Dask, Myria, Spark using previous benchmark study (Mehta et al., 2017).

Analysis of MRI data

Takeaway

  • Previous MRI benchmark was performed by a team of 4 graduate students and postdocs over 6 months.
  • Cloudknot implementation took Ariels one day
  • For 25 subjects, Cloudknot was 10-25% slower
  • Cloudknot favors workloads where development time is more important than execution time

Conclusion

  • Cloudknot favors workloads where development time matters more than execution time.
  • For many data science problems, this is an acceptable trade.
  • Simplified API makes cloud computing more accessible.
    1. import cloudknot

      knot = cloudknot.Knot()

      results = knot.map(sequence)

Additional examples

Freesurfer segmentation

Bundle extraction

Github repo: https://github.com/nrdg/cloudknot

Documentation: https://nrdg.github.io/cloudknot/index.html

We welcome issues and contributions!

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software

Training researchers from diverse backgrounds

Amyotrophic Lateral Sclerosis (ALS)

Neurodegenerative disease

Affects motor neurons

Etiology varies widely

Tractometry for group comparison

Tractometry for prediction

Patient/Control?

ALS classification

Challenges

Requires a region of interest

Otherwise: it's like finding a needle in a haystack!

The accuracy/interpretability tradeoff

The accuracy/interpretability tradeoff

Diffusion MRI as a generalized linear model

Diffusion MRI as a generalized linear model


The objective:

Find weights β that minimize the error between the estimated y and the true y

But in our case p (number of variables) >> n (number of subjects)

Many different possible solutions!

The accuracy/interpretability tradeoff

Regularization

Get the weights β that minimize the error and are as sparse as possible

For example: the Lasso (Tibshirani, 1996)

The accuracy/interpretability tradeoff

The accuracy/interpretability tradeoff

The accuracy/interpretability tradeoff

Diffusion MRI data has group structure

The Group Lasso

Sparsity of groups of β, instead of individual β

Selects data by tracts

Sparse Group Lasso

Sparsity of groups of β and of individual β

Selects data by tracts and sparsify within tract

The data is used to determine strength of group and individual β sparsity

Nested cross-validation

Accurately classifies patients/controls

Classification accuracy of 93% (+/- 2%)
AUC of 0.978 (+/- 0.01)

ALS correlates in white matter are localized

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software

Training researchers from diverse backgrounds

Cognitive development

How do adverse childhood events affect development?

How do they affect brain connectivity?

Brain development

And white matter?

Cortical thinning reflects myelination (Natu et al., 2019)

White matter tracts mature in waves over a long time (Lebel and Beaulieau, 2012)

Challenge: focusing on any particular location risks losing the full picture

Loses the forest for the trees

Brain age

Predictive model of age, based on brain features

Brain age

Predictive model of age, based on brain features

Brain age

Predictive model of age, based on brain features

Brain age

Predictive model of age, based on brain features

Brain age

Predictive model of age, based on brain features

Brain age

Predictive model of age, based on brain features

Brain age

Predictive model of age, based on brain features

The model was constructed based on a publicly available sample of 77 individuals
(Yeatman et al. 2014)

Chronoligical ages 6 - 50

Sparse group lasso

Brain age


MAE: 3.6 years, R2=0.3

Multiple white matter correlates

In some tracts FA increases with age

Early in development

In some tracts FA increases with age

Late in development

In other tracts FA decreases with age

Sparse Group Lasso

Capitalizes on brain structure

Accurately classifies ALS patients

Serves as a basis for a brain age model

Identifies dense or sparse biological feature sets

Current and future work

Adverse childhood events (Adam Richie-Halford)

Acquisition of skilled reading and math (w/ Jason Yeatman, Stanford)

Baseline model of the Human Connectome (John Kruper)

Human Connectome Project extensions

Lifespan development

Brain aging and dementia

Alzheimer's subtypes

Anxiety and depression in teenagers

Amish Connectome (mental illness)

Low vision and blindness

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software

Training researchers from diverse backgrounds

Data-driven research

Building a data science community: open, rigorous and ethical

Development of tools and practices for reproducible research

Data science education

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software

Training researchers from diverse backgrounds

Results from large multi-dimensional datasets are hard to understand

Hard to communicate

Hard to reproduce

Solution: tools for exploration with data sharing built in!

A browser-based tool for visualization and analysis of diffusion MRI data

A web-based application

Leverages modern visualization frameworks

Builds a web-site for a diffusion MRI dataset

Automatically uploads the website to GitHub

https://yeatmanlab.github.io/Sarica_2017

Exploratory data analysis

Enhances published results

Linked visualizations facilitate easy exploration

Enables new discoveries in old datasets

Generates hypotheses for new research

Automatic data sharing

http://afqvault.org

Data sharing broadens access

MRI data analysis requires specific expertise

Tract segmentation and tractometry generates data in a tidy table format (CSV)

Facilitates interdisciplinary collaboration

Facilitates reproducibility


Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software for science

Training researchers from diverse backgrounds

How can we get more people involved?

Developing software requires expertise

Learning methods for data-driven research requires substantial hands-on experience

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software

Training researchers from diverse backgrounds

How can we get more people involved?

Developing software requires expertise

Learning methods for data-driven research requires substantial hands-on experience

Hack weeks

Week-long events

Combination of learning and project work

Astro hack week

Geo hack week

Neuro hack week

A fine balance of pedagogy and hacking

Hack weeks

Week-long events

Combination of learning and project work

A fine balance of pedagogy and hacking

NeuroHackademy: A Summer Institute in Neuroscience and Data Science

NeuroHackademy 2020 goes online!

https://neurohackademy.org/apply/

Outline

Data-driven discovery in human neuroscience

Studying brain connections with diffusion MRI

Automated tract segmentation and tractometry

Applications

Amyotropic Lateral Sclerosis: finding a needle in a haystack

Brain development: seeing the forest and the trees

Expanding access to data-driven research

Publishing data and results in exploratory data visualizations

Developing open-source software

Training researchers from diverse backgrounds

Thanks!

Adam Richie-Halford
John Kruper
Anisha Keshavan
Josh Smith
Jason Yeatman (Stanford)
Noah Simon (UW Biostats)
Eleftherios Garyfallidis (IU)



Contact information

http://arokem.org
arokem@gmail.com
@arokem
github.com/arokem

Automated detection of glaucoma with interpretable machine learning using clinical data and multi-modal retinal images


Parmita Mehta
(UW CSE)

Aaron Lee
(UW Ophthalmology)
Mehta, Petersen, Wen, Bannit, Chen, Bojkian, Egan, Lee, Balazinska, Lee, Rokem (2020)

Nested cross-validation

Nested cross-validation

Nested cross-validation